Research

Human Detection of AI-Generated Phishing

An ongoing study into which phishing techniques humans miss most when AI partially standardizes linguistic quality across conditions. Data is collected through Threat Terminal, a game-based research platform where players classify emails, bet confidence, and earn XP while contributing to a real dataset.

Study Protocol

Altiparmak, S. (2026). Human Detection of AI-Generated Phishing: Study Protocol and Dataset Design for the Threat Terminal Experiment (v1.1). Zenodo.

doi.org/10.5281/zenodo.19156047 →

Preprint (not peer-reviewed) · CC BY 4.0 · Published March 22, 2026

Play Threat Terminal→

Live findings →Methodology →Findings report →

Why This Matters

Most published research on phishing detection focuses on automated filtering rather than human judgment. Security awareness training still teaches heuristics that assume poor writing quality, an assumption AI has already invalidated. The question practitioners actually need answered, which attack techniques bypass trained humans when the writing is no longer the tell, does not have good empirical data behind it yet.

That gap matters more now than it did two years ago. AI-generated phishing is compressing the skill gap between low-effort campaigns and targeted social engineering, and federal cybersecurity strategy is starting to reflect that shift. Financial services, where email-based attacks account for a disproportionate share of breaches, is particularly exposed. This study is designed to produce the kind of data that informs both training programs and detection strategy.

Research Question

The dominant heuristics for identifying phishing have historically been linguistic: look for grammar errors, awkward phrasing, unusual idioms, and formatting inconsistencies. These signals held up because real phishing campaigns were written sloppily, often by non-native speakers working at volume. That era is over. AI-generated phishing is grammatically flawless, contextually plausible, and available at negligible marginal cost.

The study question is: when linguistic quality is held constant across all emails, phishing and legitimate alike, which phishing techniques produce the lowest human detection rates?

Technique is the only independent variable. Every card in the dataset, phishing and legitimate, was generated by an AI model. This controls for writing quality and removes it as a confound. What remains are the structural and contextual properties of each technique: how it frames the request, what authority it invokes, what urgency it creates, and whether it establishes a plausible backstory.

Secondary questions: Does professional security background improve detection rates for specific techniques, or does it improve detection uniformly? Does overconfidence correlate with specific technique failures? Do security professionals show lower bypass rates than technical non-security users, or does security experience not predict detection accuracy as well as we assume it should?

Dataset

1,000

Total cards

690

Phishing cards

310

Legitimate cards

Techniques studied

Phishing cards (690)

Six techniques, 115 cards each. Each technique block is split into four difficulty tiers to ensure the dataset captures a realistic range of attack sophistication rather than clustering at a single difficulty level. Forensic metadata also varies by tier: easy and medium cards default to failed email authentication (SPF/DKIM/DMARC), while hard and extreme cards may present verified or ambiguous authentication status, removing header analysis as a reliable shortcut.

Difficulty	Cards per technique
Easy	35
Medium	35
Hard	35
Extreme	10

Legitimate cards (310)

Legitimate cards cover three real-world email categories. Including a realistic volume of legitimate emails ensures players cannot gain an edge by defaulting to phishing classifications, and that false positive rates are measurable.

Category	Cards
Transactionalreceipts, shipping, account updates	110
Marketingnewsletters, promotions, announcements	100
Workplaceinternal comms, HR, IT notices	100

The dataset is frozen at v1 once 1,000 approved cards are reached. All cards go through an admin review pipeline before going live: generated using two Claude models (approximately 80% Claude Haiku 4.5, 20% Claude Sonnet 4.6), with per-card model provenance recorded for post-hoc analysis of whether model choice affects detection rates. Cards are staged for review, then approved or rejected by a human reviewer. Cards are not added to the live dataset without review.

The platform includes a card reporting feature. Participants can flag any card during classification. A reported card is excluded from primary analysis only if review confirms one of three pre-registered criteria: incorrect ground truth, internal inconsistency between forensic metadata and email content, or a rendering or display error. Reports that do not meet these criteria do not result in exclusion.

Phishing Techniques Studied

Each technique represents a distinct social engineering mechanism. They were selected because they map to real-world attack patterns documented in threat intelligence reporting, and because they make different cognitive demands on the classifier. Technique is the only independent variable in the study. Everything else: prose quality, email structure, and presentation is held constant.

Urgency

Emails that manufacture artificial time pressure to force fast, unconsidered decisions. These messages typically invoke expiring accounts, unprocessed payments, immediate action requirements, or looming security events. The tell is structural: the email wants you to act before you think. In a controlled quality environment, where the prose is polished and the framing is plausible, urgency becomes harder to isolate as a signal because it also appears in legitimate transactional emails: password resets, shipping updates, calendar reminders.

Authority Impersonation

Messages that impersonate a figure whose instructions carry implicit compliance pressure: executives, IT departments, HR, legal teams, government agencies, or established institutions. The attack exploits deference. Recipients are conditioned to respond to certain names and titles without scrutinising the request itself. In the dataset, all sender names and organisations are plausible rather than obviously spoofed, which removes the low-effort check of looking for misspelled brand names.

Credential Harvest

Classic credential phishing: an email directing the recipient to a login page, verification flow, or account recovery process. These messages are the backbone of most real-world phishing campaigns because they work. The dataset focuses on the email layer, not the destination. Cards present the message itself and reveal forensic signals (SPF/DKIM/DMARC status, reply-to analysis, URL characteristics) after the player classifies it. The goal is to test whether players can detect the phishing intent from the message alone, before they ever click.

Hyper-personalization

Emails that reference contextually plausible personal or professional detail to establish authenticity. These might reference a recent purchase, a shared connection, a project name, an industry, or a role-specific process. The technique exploits the cognitive shortcut of recognising familiar context as a legitimacy signal. Hyper-personalized phishing is expensive to produce at scale with human writers, but AI makes it trivially cheap. This category tests whether the presence of relevant-sounding context meaningfully lowers detection rates.

Pretexting

Multi-step social engineering that establishes a believable backstory before making the ask. The email arrives as part of an implied ongoing interaction: a follow-up to a meeting that may or may not have happened, a response to a request the recipient may or may not remember making, a continuation of a vendor relationship. The pretext does the work. The request itself is often mundane. Detection requires recognising the setup as artificial rather than evaluating the request on its own terms.

Fluent Prose

Phishing with no urgency cues, no authority figure, no personalization, and no pretext. Just polished, neutral email language making a request. This is the hardest category to classify because it removes every conventional heuristic simultaneously. The email reads like a normal business communication. The study hypothesis is that fluent prose phishing will have the highest bypass rate precisely because it offers nothing obvious to flag. If that hypothesis holds, it has significant implications for how security awareness training frames the "what to look for" question.

Methodology

Game modes

The platform requires account creation via email one-time password (no persistent password is stored). All participants begin in Research Mode, which contributes classified answers to the study dataset and is capped at 30 answers (three sessions of ten cards each). Sessions draw 10 cards via uniform random selection from each participant's remaining pool without stratification, so technique representation balances naturally at scale. After completing Research Mode, participants unlock Freeplay and Expert Mode. Freeplay uses separately generated cards that are not part of the research card pool. Freeplay data is not persisted to the study database.

Classification and confidence

For each card, players make two decisions: classification (phishing or legitimate) and confidence level. Confidence is expressed in three tiers:

Level	Score multiplier	Interpretation
GUESSING	1×	uncertain classification
LIKELY	2×	moderate confidence
CERTAIN	3×	high confidence

Confidence data is recorded alongside correctness. This allows the study to measure calibration: whether players who report high confidence are actually more accurate, whether overconfidence clusters around specific techniques, and whether security professionals show better calibration than non-security users.

Data collected per answer

Research Mode answers are linked to a pseudonymous player UUID. Email addresses are held only in Supabase Auth and are never stored in research tables. The research tables record:

Player UUID (pseudonymous, not linked to email outside auth)
Game mode (research mode only)
Card technique, difficulty tier, and correct classification
Player answer, confidence level (GUESSING / LIKELY / CERTAIN), and input method (button or keyboard)
Three timing measurements: total time from card render to answer, confidence-to-answer time, and confidence deliberation time (all in milliseconds)
Forensic interaction telemetry: scroll depth percentage and whether URLs were inspected (header inspection was collected prior to panel removal; see protocol change below)
Session context: answer ordinal position, running correct-answer streak, cumulative correct count, and session identifiers

Professional background

Players can optionally self-report their professional background on their profile. This field is used to compare bypass rates across groups and test whether security experience produces meaningfully better detection outcomes. The three options are:

INFOSEC / CYBERSECURITYworking in security
TECHNICAL / NON-SECURITYtechnical role outside security
OTHERgeneral users, students, non-technical roles
PREFER NOT TO SAYexcluded from group comparison analysis

Background is optional. Players who select “prefer not to say” are excluded from group comparison analysis while their answer data remains in the main dataset.

Forensic signals

After each round, players see a forensic signal breakdown for every card they classified. This serves a dual purpose: it functions as the learning layer of the game, and it trains players on real detection signals rather than just telling them the answer. The signals revealed are:

SPF / DKIM / DMARC (removed)

Authentication status for the sending domain was originally displayed during classification. The header inspection panel was removed mid-study after analysis showed it introduced a confound: participants with technical backgrounds used it as a shortcut rather than evaluating the email itself, which undermined the study's focus on technique-level detection. Authentication metadata remains in the dataset for post-hoc analysis. See the v1.1 addendum for full rationale.

Reply-To Mismatch

Whether the reply-to address differs from the from address. A common technique for harvesting replies without controlling the sending domain. Legitimate bulk email often uses separate reply-to addresses, so this requires contextual interpretation.

Send Timestamp Analysis

The time and timezone offset of the message. Emails sent at unusual hours or from unexpected timezone offsets can indicate automated sending infrastructure or a mismatch between the claimed organisation and the actual sender location.

URL Inspector

Tappable links that reveal destinations. Hovering or tapping a link in a real email is one of the most reliable quick checks available. The game simulates this to train the habit and to show players the gap between displayed anchor text and the actual URL.

Attachment Name Analysis

Where applicable, the filename and extension of attachments. Double extensions, unusual formats for the claimed document type, and names engineered to trigger opens are all represented in the dataset.

Protocol change: authentication header panel removal

The original platform design included an interactive panel where players could inspect email authentication headers (SPF, DKIM, DMARC status) during classification. This panel was removed mid-study as a deliberate protocol change, documented in the v1.1 addendum.

The rationale: the study is designed to measure human detection of phishing techniques, not header literacy. Early data showed that participants with technical backgrounds were using authentication status as a classification shortcut, bypassing the email content entirely. This created a confound where detection rates reflected header-reading ability rather than technique recognition. Removing the panel refocused the experiment on what it was designed to measure.

Authentication metadata is still stored in the dataset and remains available for post-hoc analysis. Answers collected before and after the panel removal are flagged in the data, allowing the change to be accounted for in analysis. The addendum documents the exact timing, affected data points, and analytical approach for handling the transition.

Expert Mode

After submitting 30 research answers (3 completed sessions), participants unlock Expert Mode. Expert Mode draws exclusively from separately generated extreme-tier cards (10 cards per technique, 60 cards total in the phishing pool) that are not part of the Research Mode pool, and awards double XP. Expert Mode answers are not included in the primary research dataset. Upon graduating to Expert Mode, participants can no longer contribute new data to the study; they continue to play for engagement and review personal statistics, but subsequent classifications are excluded from analysis. This design prevents experienced participants from skewing the research dataset with repeated exposure effects. Expert Mode engagement data is tracked separately for potential secondary analysis of skill progression.

Limitations

Self-selected sample

Participants are players who discovered Threat Terminal and opted into Research Mode. This is not a random sample of the general population. Results will over-represent people who are security-aware or curious about phishing, which likely biases detection rates upward compared to a general workforce sample.

Game context

Players know they are classifying emails in a game environment. This may produce different cognitive engagement than real-world email triage, where classification competes with other tasks and attention is not guaranteed. Game context may inflate detection rates by focusing attention on the task.

AI-generated cards

All cards, including legitimate ones, are AI-generated. This produces a controlled dataset but means the legitimate emails do not carry the full contextual richness of real correspondence. In practice, recipient-specific context (knowing the sender, expecting the email, recognising internal references) is a strong legitimacy signal that the dataset cannot replicate.

Self-reported background

Professional background is self-reported and not verified. Players may misclassify their background or select options that do not accurately reflect their day-to-day exposure to security concepts.

Hyper-personalization ceiling

Hyper-personalization cards use plausible contextual detail rather than genuine information about the specific participant. The study measures recognition of hyper-personalization as a technique structure, not the effectiveness of genuine personalization. In-study bypass rates for this technique will not reflect its real-world ceiling.

Card classification reliability

Each card is assigned a technique label and difficulty tier by the author alone. No second coder independently classified the cards, and no inter-rater reliability metric is reported. Technique assignment is determined at generation time by the prompt specification, and the admin review pipeline rejects cards that do not clearly instantiate the specified technique, but a formal reliability audit with an independent coder is planned prior to the empirical findings paper.

Within-study learning effects

Participants receive forensic signal breakdowns after each 10-card session, meaning classifications in sessions two and three are informed by feedback from prior sessions. This pedagogical feedback is integral to the platform but introduces a potential learning confound. The planned analysis includes session order as a covariate to estimate the magnitude of within-study learning.

Base rate awareness

The 69/31 phishing-to-legitimate ratio is a design choice to ensure statistical power and prevent gaming, not a reflection of real-world base rates where the overwhelming majority of messages are legitimate. Absolute detection rates should be interpreted in context rather than as estimates of real-world performance.

Single-family generation

The dataset comprises cards generated by two models from the Anthropic Claude family (Haiku 4.5 and Sonnet 4.6). While using two models introduces stylistic variation, both share underlying training characteristics. Per-card model provenance is recorded, enabling secondary analysis of whether generation model affects detection rates.

Hypotheses

These hypotheses were formed before data collection began and are stated here to distinguish predictions from post-hoc rationalisations once results are available.

Highest bypass rate among named techniques

Among the five named technique categories, Pretexting is expected to produce the highest bypass rate. Pretexting conceals malicious intent inside a plausible narrative; the request itself appears mundane, requiring detection of the surrounding setup rather than the content of the ask. Fluent Prose, which removes every conventional social engineering mechanism simultaneously, is expected to produce a high bypass rate as well, but serves a structurally distinct role as a baseline measuring detection failure when no identifiable trick is present. The empirically informative comparisons are the relative ordering among the five named techniques and the gap between those techniques and the Fluent Prose baseline.

Hyper-personalization deserves a separate note here. In real-world deployments, it would likely be the most effective technique of all: an email that references your actual name, role, recent activity, or known colleagues is substantially harder to dismiss than a generic message. Cards labelled hyper-personalization use plausible contextual detail rather than detail drawn from the specific player seeing the card. This is a deliberate scope boundary, not a flaw in the design. The study is measuring recognition of technique structure: can players identify that an email is attempting to exploit personal familiarity, regardless of whether it references their actual details? That is a separable question from whether real-world personalisation is effective, and it is the question this dataset is built to answer. In-study bypass rates for hyper-personalization will not reflect its real-world ceiling, but they were never intended to.

Lowest bypass rate

Credential Harvest is expected to be the most detectable technique. It is the attack pattern most consistently covered in security awareness training, and players are conditioned to scrutinise login prompts and link destinations more than any other element of an email. Even in a controlled environment where prose quality is held constant, the structural fingerprint of credential phishing is recognisable: there is always an ask to authenticate somewhere.

Group differences

Security professionals (INFOSEC group) are expected to outperform both technical non-security users and general users in overall detection rate. Daily exposure to threat patterns, incident reports, and phishing simulations should produce better intuition across most technique categories. The more interesting question is whether that advantage is uniform or concentrated: security professionals may show dramatically better detection on some techniques while performing comparably to other groups on techniques that exploit cognitive shortcuts rather than technical knowledge.

Confidence calibration

Players will be overconfident when wrong. Incorrect classifications are expected to skew toward LIKELY and CERTAIN rather than GUESSING, meaning players will not just miss phishing emails but will miss them while feeling sure they are right. This pattern is expected to cluster on techniques that produce the most plausible-looking output: pretexting and fluent prose. If a well-constructed pretext reads like a normal email, the player who misclassifies it has no signal telling them they should be uncertain. That confident wrongness is a meaningful finding in its own right, separate from raw bypass rates.

Status

Data collection ongoing. A findings report is in progress and will be published once the study reaches 100 participants.

Live findings are published at research.scottaltiparmak.com/intel and update in real time as data comes in. A formal write-up will be submitted for peer consideration as the dataset matures. If you are a researcher interested in collaborating, reviewing methodology, or discussing the data, get in touch.

Participate

All participants begin in Research Mode after creating a free account (email OTP, no password). Each session is 10 cards and takes about five minutes. After completing 30 research answers (three sessions), Freeplay and Expert Mode unlock.

Play Threat Terminal